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. . i be finite and of fixed size S > 2. The Longest Common described in Jl 

Subsequence problem, which we shall refer to as the LCS Another, less obvious motivation for the study of this 

problem, consists of finding a sequence of letters which problem comes from the fact that it can be formulated as 

appears as a subsequence of both X and Y, and which is of a model of directed passage time percolation on a two di- 

maximal size. Equivalent^ one can ask for two sequences mensional (triangular) lattice f2§0. T ° see this, consider 

1 < ii < ... < i L < N and 1 < ji < ... < j L < M such the d i rected lattice whose vertices are the integer points 

that X lk = Y jk , 1 < k < L and L is maximal. ^ < i < TV, < j < M and whose edges are the bonds 

The length of a LCS can be viewed as a natural mea- formed by nearest neighbors together with the bonds of 

sure of the "proximity" of different strings of letters, ft is the form {(i - 1, j - 1), (ij)}, 1 < i < N, 1 < j < M, all of 

an example of the "best sequence alignments" which are these bonds being oriented according to the positive direc- 

of use in biology, in tests for commringlong molecules tion of the axes. To each bond between nearest neighbors 

such as proteins and nucleic acids |J,|lJ,|o) . attach the weight 0, and to each bond {(i -1J- 1), {ij)} 

attach the weight <5x;,y, , that is 1 if Xi = Yj, and to 

Send offprint requests to: J. Boutet de Monvel otherwise. Define the weight of any path on this lattice to 
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Abstract. Given two strings X and Y of N and M characters respectively, the Longest Common Sub- 
sequence (LCS) Problem asks for the longest sequence of (non-contiguous) matches between X and Y. 
Using extensive Monte Carlo simulations for this problem, we find a finite size scaling law of the form 
E(Ln)/N — 7s + As/(ln Ny/N) + ... for the average LCS length of two random strings of size iV over S 
letters. We provide precise estimates of 7s for 2 < S < 15. We consider also a related Bernoulli Matching 
model where the different entries of an TV x M array are occupied with a match independently with proba- 
bility 1/5. On the basis of a cavity-like analysis we find that the length of a longest sequence of matches in 
that case behaves as L NM ~ 7^ {r)N where r = M/N and 7#(r) = (2 V / r5-r - l)/(5- 1). This formula 
agrees very well with our numerical computations. It provides a very good approximation for the Random 
String model, the approximation getting more accurate as 5 increases. The question of the "universality 
class" of the LCS problem is also considered. Our results for the Bernoulli Matching model show very 
good agreement with the scaling predictions of [[l2| for Needleman-Wunsch sequence alignment. We find 
however that the variance of the LCS length has a scaling different from Var(LN) ~ JV 2//3 in the Random 
String model, suggesting that long-ranged correlations among the matches are relevant in this model. We 
finally study the "ground state" properties of this problem. We find that the number Nlcs of solutions 
typically grows exponentially with N. In other words, this system does not satisfy "Nernst's principle". 
This is also reflected at the level of the overlap between two LCSs chosen at random, which is found to be 
self averaging and to aproach a definite value qs < 1 as N — > 00. 

PACS. 75.10.Nr Spin glass and other random models - 02.60.Pn Numerical optimization 

1 Introduction ft is also an important problem in computer science, 

as the length of a LCS of two strings is closely related 
to the number of editing operations (insertions/deletions) 
Let X = (Xi, ...,Xjy) and Y — (Yi, Ym) be two strings which are necessary to transform one string into the other 
of characters. Here the X,'s and T/s are letters of a given (the so called "string-edit" distance) |§]. A large number 
alphabet, which will be assumed throughout this paper to f variants and applications of the LCS problem are also 
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be the sum of the bonds' weights along the path. Then 
clearly a LCS between X and Y may be constructed from 
any directed path of maximum weight joining the point 
(0,0) to the point (N,M). If we interpret the weight of 
a bond as a time required for the passage of that bond, 
we seek the maximum rather than the minimum passage 
time from (0,0) to (N,M), but this is of no significance 
here. 

This paper is concerned with the stochastic version of 
the LCS problem, where one is given very long strings 
the letters of which are chosen at random, independently 
and uniformly in a given alphabet of size S. This problem 
has retained much attention 0Jl^,^l| (see also for a 
recent review). The main issue is to understand the large 
N behaviour of the LCS length of the N first letters of 
X and Y. Let Lm be this number. Observing that the 
sequence (Lat) is superadditive (Ln 1 +n 2 > Ln 1 + £at 2 ), 
and using the martingale difference method, one can prove 
in an elegant way |22| that with probability one (for infi- 
nite strings), Ljy is asymptotic from below to 7s N, where 

< 7s < 1 is a constant whose exact value is unknown. 
It has also been proved 0,|l^| that the rate of convergence 
of the expected ratio E(L^)/N to 75 is at least as fast as 

0(y/\llN/N). 

In the passage time percolation picture the weights at- 
tached to the bonds are correlated random variables (for 
example the occupation numbers of the matches on the 
corners of any rectangle of the lattice are obviously cor- 
related). We consider also a related model where each 
bond {(i — l,j — l),(r/)} is given a weight 1 (resp. 0) 
independently of the others with probabitlity 1/5 (resp. 

1 — l/S). We shall refer to this model as the Bernoulli 
Matching model, and denote by the maximum weight 
of a directed lattice path joining (0, 0) to (./V, iV) (equiva- 
lently is the maximum L for which there are sequences 
1 < h < ... < i L < N and 1 < ji < ... < Jl < N such 
that (ik,jk) is a match, 1 < k < L). We let 7J? be the limit 
limAr^oo Lfj /N, which is shown to exist a.e. in exactly the 
same way as for 75. Also we note that Alexander's rate 
result applies to E^L 1 ^) as well. 

Much effort have been made to get bounds on 7s |ic],|| , 
but there are still non negligible gaps between the known 
upper and lower bounds ||. Estimations of 75 based on 
numerical simulation are also available [ S|Jl],|30f but appar- 
ently no attempt has been made to determine numeri- 
cally the finite size corrections to the linear scaling law 
E(L N )~ ls N. 

This paper presents the results of extensive Monte 
Carlo simulations for the LCS problem, showing that the 
difference 7s N — E(Ln) has a well defined asymptotic 
behaviour, allowing one to get precise estimates of 75 by 
extrapolation. The same finite size scaling law appears to 
hold for the Bernoulli Matching model, and we have ob- 
tained corresponding estimates for 7^ . 

We further considered the case where the strings X 
and Y are of different sizes, N ^ M . The relevant case 
occurs when N and M are large but comparable, namely 
N, M -> 00, the ratio r = M/N being fixed (r > 0). Let 
Ln{t) = iAr,[ r AT] be the length of a LCS of Xi, Xpj and 



Yi, Yr r j\n. Then with probability 1, one has 
limAr^oo L N (r)/N = 7s(r) where < 7s(r) < 1. Of 
course 75 (1) = 75, and the function 7s (r) has the ob- 
vious symmetry property 7,5 (1/r) = l/r7s(r). In the pic- 
ture of directed percolation r is given by tan(-7r/4 + (f>) 
where (f> E [— 7r/4,7r/4] is the angle between the direction 
of interest and the first bisector, and the object of interest 
is the set of points which are "wet" at time t, defined here 
to be the set Ct = {(ij) ■ < t}. As t — ► 00 (for N and 
M infinite) the set Ct/t is asymptotically delimited by 
the curve of polar equation p(<p) = ^/l + r(<^>) 2 /7s(r(0)). 
The above symmetry property reflects the fact that C t is 
asymptotically symmetric with respect to the first bissec- 
tor. A percolation transition occurs in this problem when 
r = r c = S, namely 7s (r) = 1 for r > S while 7s (r) < 1 
for r < S. By symmetry we have another transition at 
r = l/r c = l/S, such that 7s(r) = r for r < l/S and 
7s( r ) < r for r > l/S. Analogous comments apply to 
the Bernoulli Matching model. In that case we provide 
a simple analytic expression for the corresponding func- 
tion 7^ (r) , which is derived (see section (||) below) on the 
basis of a cavity-like analysis of the LCS problem. The 
cavity method is an approximation scheme generally con- 
sidered to be appropriate for describing the mean field 
theory of disordered systems (such as spin glasses) [p~S| . 
The Bernoulli Matching model is not a mean field model 
however, but really a two dimensional percolation model, 
and by "cavity" we mean the following: First, the proper- 
ties of the system can be computed by use of a recursion 
formula. This is equation (|l]) given below, which is valid 
for the Random String model as well as for the Bernoulli 
Matching model. Second, a decorrelation, or "clustering" 
property jl5| happens to hold in the Bernoulli Matching 
model, allowing the recursion formula to be solved at large 
N, M by use of a self-consistent approximation. This leads 
to an expression of 7J? (r) in very good agreement with our 
numerical results. 

We finally investigated the "configuration space" prop- 
erties of this problem, which are most easily accessible by 
constructing what we call the LCS graph of given strings 
X and Y. This structure is defined in section (0). It can be 
computed in a very efficient way, and it gives a direct ac- 
cess to properties of the set of LCSs of X and Y, enabling 
one to compute such quantities as: 

i) The total number Nlcs of LCSs of X and Y. 

ii) The average overlap between two LCSs chosen at 
random among the set of LCSs of X and Y . 

iii) The distribution of the distance between two suc- 
cessive matches in a LCS. By distance we mean here the 
Manhattan distance \i\ — ii \ + |ji — ji\ for given points 
(ilii), {iih)- 

iv) The mean square "displacement" with respect to 
the first bissector, of the matches along a LCS, where (fol- 
lowing [^2)) the displacement coordinate of a point (ij) is 
defined to be i — j. 

These type of computations are of interest because 
they provide informations on the structure of the set of 
solutions which in other systems may be very difficult to 
obtain. For example our computations show that typi- 



J. Boutet de Monvel: Extensive Simulations for Longest Common Subsequences 



3 



cal random strings have many common subsequences of 
maximum length. Their number typically grows exponen- 
tially with N, i.e. the ground state entropy of this sys- 
tem is not zero. We provide estimates of this entropy 
and of the typical overlap between two randomly chosen 
LCSs for several values of S. Properties such as iv) are 
of physical interest as they depend on long-ranged cor- 
relations among the matches in a LCS, and they charac- 
terize the "universality class" of the LCS problem. This 

§:stion has been recently analysed by Hwa and Lassig 
who showed that the percolation formulation of the 
LCS problem (and more generally of Needleman-Wunsch 
sequence alignment described below) can be treated in 
the continuum limit as a model of directed polymer in a 
quenched random medium. In this analogy each directed 
path on the above defined lattice is assigned an energy 
— W, where W is the weight of the path. The statistics 
of these paths is taken to be the Boltzmann-Gibbs dis- 
tribution The LCS length then corresponds to the 
ground state energy of the "bridge" from (0, 0) to (N, M). 
In the case of the Bernoulli Matching model, this leads 
to a complete characterization of the universality class of 
the model: The continuum limit is described by the well 
studied 2D-directed path (or equivalently the ID-random 
walk) in a Gaussian random potential. The fluctuations 
of L N and of the "displacement" i — j along the optimal 
paths are governed by exactly known universal exponents 
U = 1/3 and ( — 2/3 respectively. Hence Var(L N ) should 
grow asymptotically as N 2 / 3, and the mean square dis- 
placement as iV 4 / 3 . Our numerical results agree very well 
with these predictions. The question of the universality 
class of the Random String model is more subtle, as this 
model involves long ranged correlations in the disorder. 
Hwa and Lassig provide evidence that these correlations 
are not relevant in the continuum limit for a range of the 
defining parameters of Needleman-Wunsch alignment. In 
the regime corresponding to the LCS problem (which was 
not considered in [jl2| ) , our results only partly supports 
the above predictions: The measured mean square dis- 
placement for the Random String model show no devi- 
ation from the superdiffusive TV 4 / 3 scaling. The behaviour 
of Var(Lpf) is close to, but significantly different from 
iV 2 / 3 , suggesting that correlations among the matches in 
the Random String model are relevant to the universality 
class of the LCS problem. It should not be considered a 
surprise that the scaling relation oj — 2£ — 1 appears inval- 
idated by our results. This scaling relation is known to be 
intimately connected with Galilean invariance |]l4| . In the 
formulation of the Random String LCS problem as a ID- 
random walk, long-range temporal correlations are present 
in the random potential, and Galilean invariance is bro- 
ken. What is surprising is that only the fluctuations of the 
ground state energy show a scaling affected by these cor- 
relations. This is left to the reader as an interesting open 
question. 

We close this introduction by explaining the position 
of the LCS problem with respect to sequence alignment 
methods in molecular biology. The purpose of these meth- 
ods is to provide efficient tools for the detection of relevant 



similarities among DNA molecules or among proteins. Rel- 
evance refers here to finding the functional and evolution- 
ary relationships between these molecules, and is a main 
biological issue. This problem is the source of a rich inter- 
play between biology and computational sciences (see [^9| 
for reviews). Even if determining what is the "best align- 
ment" of two sequences for biological purposes remains 
in part a matter of art, standard comparison algorithms 
are widely used by biologists. These algorithms are very 
useful to confront a newly discovered DNA molecule or 
protein to the huge existing databases of known molecules 
(and then to infer the possible functional properties of the 
new molecule). The LCS problem corresponds to a class of 
alignment algorithms discovered by Needleman and Wun- 
sch [|l(| , which provided the first systematic tool for taking 
into account the insertions and deletions which naturally 
occurs in the evolution of biological sequences. To describe 
this approach consider again the percolation formulation 
of the LCS problem. An alignment of the strings X and 

Y is viewed as a directed path on the the lattice defined 
above, tracing a possible "evolution" from X to Y: Each 
diagonal bond (ending at (ij)) on the path represents a 
substitution of the letter Yj to the letter X{ (if (ij) is 
a match Xi is left unchanged). Horizontal and vertical 
bonds represents respectively deletions and insertions, also 
termed as indel operations, or "gaps". In this way each 
directed path from (0, 0) to (N, M) corresponds to a well 
defined sequence of edit operations transforming X into Y 
(or equivalently Y into X), which is the usual definition of 
an "alignment". A given path 7 is assigned a score W(*y), 
which is defined (in the simplest version of the model) by 
weighting each substitution along 7 with a matching func- 
tion s(Xi,Yj), and each gap with a penalty —5(5 > 0). 
A common choice for s(Xi,Yj) is to assign a score 1 to 
a match Xi = Yj and a penalty — /z (/j, > 0) to a mis- 
match Xi 7^ Yj. The optimal alignments are determined 
by maximisation of the score W(j). We are then facing a 
longest path problem very similar to the LCS problem. In 
particular the optimal score Wnm from (0, 0) to (N, M) 
can be computed in an efficient way using a straightfor- 
ward adaptation of the dynamic programming algorithm 
of section (|^). Needleman-Wunsch sequence alignment is 
a global alignment method, since the whole strings X and 

Y are aligned together. The optimal alignments are in- 
variant by multiplying the matching function and the gap 
penalty by any positive constant. Moreover the numbers 
iV+, N-, and N g of matches, mismatches, and gaps re- 
spectively along any directed path from (0, 0) to (N, M) 
are related by 2N+ + 2A_ + N g = N + M. Hence with 
the above choices the number of independent parameters 
is reduced to one: It is equivalent to maximize 1^(7) = 
N + - 11N- - 5N g or to maximize ^(7) = N+ - eN g , 
where e = (5 - (i/2)/(l + it). As N, M -> 00 the modified 
optimal score behaves as Wnm ~ a(e,r)N (r — M/N), 
where a(e, r) is a monotonous decreasing (demonstrably 
continuous) function of e. For e < —1/2 the problem is 
trivial and Wnm = ~(N + M)e. When -1/2 < e < 0, 
it is always advantageous to change a mismatch for two 
gaps (an insertion followed by a deletion). We may then 
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assume 2N + + N gaps = N + M and the problem reduces 
to maximizing iV+, i.e. to the LCS problem. In this region 
a(e,r) interpolates linearly from its value at e = —1/2 to 
its value at e = 0. The case e = corresponds exactly to 
the LCS problem: Mismatches and gaps are then equiva- 
lent as regards to the score. Since gaps and mismatches are 
known both to occur during evolution, and are not equiv- 
alent energetically, the biologically relevant region clearly 
lies within e > 0. Hence the LCS problem represents a nat- 
ural (even if unrealistic) limit case of Needleman-Wunsch 
sequence alignment. It must be pointed out that for biolog- 
ical purposes (in particular for detecting weak similaritites 
between rather remote sequences), local rather than global 
alignment is often required. A powerful approach to local 
alignment is Smith- Waterman algorithm []20f| , which max- 
imizes the score W{^) over all pairs of substrings (i.e. con- 
tiguous segments) of X and Y. In the percolation picture, 
the end points of the paths associated with local align- 
ments are no longer fixed. The gap and mismatch penal- 
ties are then really different parameters and strongly influ- 
ence the optimal alignments. In fact for random sequences 
Smith- Waterman alignment undergoes a phase transition 
from global to local alignment &M: For small S and /i, 
more precisely as long as S and /i are such that the opti- 
mal score Wnm obtained by global alignment is positive, 
we recover essentially Needleman-Wunsch alignment: For 
large N, M, the optimal Smith- Waterman score Hnm sat- 
isfies Hnm ~ Wnm with high probability. Note that the 
case 6 = reduces as before to the LCS problem. For 
sufficiently high gap and mismatch penalties, global align- 
ment leads to a negative score Wnm growing linearly with 
N, M in absolute value. A positive score can be achieved 
only by small paths taking advantage of the local fluc- 
tuations in the density of matches. This is the genuinely 
local phase, where Hnm grows only logarithmically with 
N, M. Clearly the LCS problem is no more relevant to this 
phase. For example the exponential proliferation of solu- 
tions occuring in the LCS problem, relevant to the global 
phase, is replaced in the local phase by a small number 
of well characterized optimal and suboptimal alignments 
|f27f . The transition line between the global and the lo- 
cal phases, which separates the regions of positive and 
negative linear growth of the global score Wnm , is easily 
determined from the knowledge of a(e,r) defined above. 
Interestingly, the neighborhood of this transition line is 
found empirically to be a most relevant (5, /Lt)-region for 



biological purposes 
provides some valua 



Hence the value a(0,r) = 7s(r) 
j\c information thanks to the mono- 
tonicity of a(e, r). More importantly, even if the biological 
relevance of purely global alignments is for the present 
difficult to address, clearly it is of interest to understand 
their statistical properties [jllj. As the LCS problem corre- 
sponds in some sense to the "most" global case of sequence 
alignment, it deserves particular attention. 



2 The average length of a Longest Common 
Subsequence. 

There are several algorithms for computing the LCS length 
of two strings X = (X 1: Xn) and Y = (Yi, Ym)- 
The best known is based on a dynamic programing ap- 
proach as follows. For i,j > 1, let L^ be the length of 
a LCS of (Xi, ■■■,X i ) and (Yi, —,Yj). We call the matrix 
(Lij) the LCS matrix of the given instance. The strat- 
egy consists of using the fact that Ly can be readily 
computed if £$— ij and are known. In- 

deed one has L^ = Li_ij_i + 1 when Xi 
Lij = max(Li_ij, when Xi ^ Yj. In short 



Yj, and 



= max(Li 



-i,i> -^ij-i 



, Li-x,j-i + 8xi,Yj)- (!) 



This recurrence relation, with the obvious initial condi- 
tions Lifi = Lqj = 0, provide a very simple and efficient 
way to compute the LCS matrix of X and Y. This algo- 
rithm arises also naturally in the passage time percolation 
picture. Indeed the LCS problem, viewed as a longest di- 
rected path problem as described above, has a natural for- 
mulation as a linear programming problem. Relation (|l]) 
is nothing but the solution to the dual program, which is 



(2) 



for given 
straints 



min Lnn — L m 
numbers Lij,0 < i,j < N subject to the con 



Hj > •C'j-ijj-i + &Xi,Yj , Lij > Li-ij,Lij > Lij-i, 



l<i,j<N, (3) 



and L 



f.O 



J 0,j 



0. 



The time required to compute the LCS matrix of X 
and Y using ([!]) is given essentially by the product NM. 
Of course the whole LCS matrix contains more informa- 
tion than needed to construct a LCS of X and Y or to 
compute their LCS length. More involved algorithms fo- 
cuss attention on subsets of the set of matches of X and 
Y, i.e. the set of points (ij) such that X{ = Yj. These 
algorithms may achieve much better time bounds in some 
special cases. However no algorithm is known for the LCS 
problem which achieve a significantly better time bound 
than 0{NM) in the general case, or even in average when 
X and Y are two random strings over an alphabet of size 
S > 2. The fastest known algorithm is described in E3|. 

Moreover relation (|l|) is highly suited for a finite size 
scaling analysis of the LCS length, as it may be easily 
implemented in order to compute in time 0(N 2 ) and space 
O(N) the whole profile of values Li = L^i,l < i < N 
for any given instance. Indeed, to compute the i th line of 
the LCS matrix it is not necessary to have stored all the 
previous lines, since only the (i — l) th line is needed. This 
property also makes the computation of the LCS matrix 
parallelisable to some extent, and a significant speed up is 
obtained in the case of very long strings by implementing 
(m) on a parallel machine. 



J. Boutet de Monvel: Extensive Simulations for Longest Common Subsequences 



5 



2.1 Finite size behaviour of E(Ln)- 

In order to measure the finite size behaviour of the average 
LCS length, we made a direct Monte Carlo evaluation of 
E(Ln) for all TV up to a certain number and over large 
samples of random strings. Namely we computed averages 
of Ljv over 10 5 instances for TV < 1500, and over 10 4 
instances for 1500 < TV < 10 4 . We then extrapolated these 
estimates to the large TV limit by using a x 2 analysis. In 
order to check the extrapolation procedure we performed 
a second series of experiments on a parallel computer, over 
smaller samples of 30 to 50 instances, but for problem sizes 
up to TV = 10 5 . 

We found that a very reliable extrapolation to the large 
TV limit is obtained if one assumes a finite size behaviour 
of the form 



E(L 



N) 



A 



TV 



is 



s 



In TVx/TV 



(4) 



Here An is a negative constant and ejy represents further 
corrections which we expect to be at most 0(1/N). To 
extract the precise asymptotic behaviour (if any) of £n 
would certainly require improvement on the precision of 
our finite size estimates. The statistical precision we had 
on E(L N ), up to TV < 10 4 , was better than 0.002%, and 
further improvement would have been very time consum- 
ing. 

By using a best fit of our TV < 1500 estimates based 
on @ with e N of the form e N = K/(N\n a TV) {K and a 
being constants) we get a surprisingly good extrapolation 
up to values of TV of order 10 5 , which one would not be 
able to obtain by using another form than (0). 

However the form chosen for e^v remains somewhat 
arbitrary. Since the estimation of 75 and As should be 
more precise when extrapolating from larger values of TV, 
we performed a second series of extrapolations, using the 
finite size estimates obtained for 1500 < TV < 10 4 . For 
these values of TV, the term ejv is much less significant, 
and a linear extrapolation of E(Ln) as a function of x = 
1 / (\n Ny/N) is already very precise. Figure (|l|) reproduces 
our results in the cases S — 2, S — 3 and S = 15. The solid 
curves in these figures are best fits of our 1500 < TV < 10 4 
estimates to a linear function of l/(y/~NhiN). In this way 
we obtained the estimates of 7s and A$ which are given 
in table (@) a) for 2 < S < 15. 

To obtain error bars on these estimates one should use 
a x 2 analysis . However this method underestimates the 
true error here. Indeed, x 2 analysis leads to errors for the 
fitting parameters which decrease as l/\/n for n » 1, n 
being the number of degrees of freedom, that is the num- 
ber of independent datas in the fit. Since we computed for 
each instance a whole profile, the averaged points in fig- 
ure (|l|) are not independent: There are correlations in the 
sequence (Li), 1 < i < TV, which results in a smoothing of 
the averaged profile, or equivalcntly in a reduction of the 
"effective" number of independent datas in the fit. More- 
over these correlations are long-ranged, which makes it 
uneasy to measure an effective number of degrees of free- 
dom. We thus relied on a semi-empirical method, by mea- 
suring the range over which the fitting parameters varied 



for different choices of €n- Typically we obtained in this 
way an "error" less than 0.01% on 75 and 5% on As- In 
fact ejv happens to be only slightly larger than the pre- 
cision of our finite size estimates for 1500 < TV < 10 4 . 
We thus expect the above procedure to provide a faith- 
ful (slightly overestimated) measure of the true error on 
our estimates. Rather than quoting semi-empirical error 
bars, table ([]]) gives the results obtained by making re- 
spectively the choices a)ejv = 0, b) ejy ^ Bs/N and c) 
6n ~ -Bs/ (TV In TV). Note that the cases b) and c) agree to 
a better accuracy together than with case a). To determine 
the precise form of the "second order" corrections however 
clearly more precise computations would be needed. 



2.2 The variance of Ln and the universality class of 
the LCS problem. 

It has been observed long ago by Chvatal and Sankoff 
[0 that the variance of the LCS length is numerically 
very small. These authors even conjectured that Var(LN) 
might be o(TV 2 / 3 ). It has been suggested by Talagrand 
(in the context of longest increasing subsequences [p4|), 
that the smallness of Var(L,N) may be related to the 
fact that the number of LCSs of two random sequences 
is very large. The only known general bound is however 
Var(LN) — O(N), an immediate consequence of the con- 
centration inequality (Q). Anyway the work of Hwa and 
Lassig |1| provides a theoretical answer to the conjecture 
of Chvatal and Sankoff: The LCS problem falls into the 
universality class of a model of directed polymer in a 2D 
random potential. The variance of L N in the Bernoulli 
Matching model should grow as TV 2 / 3 . For the Random 
String model we must be cautious with this prediction, but 
it should provide at least a first approximation. The scal- 
ing behaviour of the variance of the LCS length is shown 
in figure (|J), both for the Random String model and for 
the Bernoulli Matching model, and for different values of 
S. We see as expected a very good agreement with the 
scaling law Var(L N ) w TV 2 / 3 for the Bernoulli Matching 
model. The results for the Random String model are more 
interesting: The scaling of Var(LN) is slightly, but clearly 
different from TV 2 / 3 . The correlations among the matches 
should be expected to be more and more relevant as iV 
grows, since 27V independant variables are involved in Ln 
against TV 2 in L N . In fact our results suggest that some- 
thing like a crossover occurs from a small TV scaling regime 
where Var(LN) ~ TV 2 / 3 to an asymptotic scaling regime 
where Var(L N ) ~ N 2uJ ' , u' > 1/3. Note that this asymp- 
totic regime seems not completely reached on figure (||) 
which includes estimates for TV up to 10 4 . Hence the "small 
TV" regime is rather extended. As is apparent on the figure, 
it becomes more and more extended as S increases, and 
the asymptotic regime is more and more difficult to reach. 
For this reason it is difficult to tell if the exponent u>' de- 
pends on S or not. It is also difficult to tell if the numerical 
dependancies of Var(L N ) and Var(LN) respectively on S 
remain reversed in the asymptotic regime. As is seen on 
figure (§A) however, our datas for S = 2 and S = 4 are 
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E (L) /N 



S=2 



0.004 




x 



E (L) /N 



S=3 




.001 



.002 



. 003 



x 



. 004 



E (L) /N 



S=15 




Fig. 1. Extrapolation of the N < 10 4 estimates for E(Ln)/N to the large N limit for S = 2, 3 and 15. The solid curves represent 
best fits to a linear function of x = l/(ln N\/~N). Estimates of E(Ln)/N for 2.10 4 < N < 10 5 , not taken into account in the 
extrapolation, are also included. 
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Table 1. Results of an extrapolation of our finite size estimates (1500 < N < 10 4 ) based on (^) with different choices of ejv. a) 
ejv = 0; b) ejv ~ Bs/N; c) ejv ~ Cs/(Nba.N). The numbers in parentheses represent statistical errors obtained by % 2 analysis, 
in units of the last written digit. 



a) 



8 


7s 


A s 




S 


7s 


As 




2 


0.812282(2) 


-1.6276(5) 




9 


0.493582(3) 


-1.734(2) 




3 


0.717634(3) 


-1.665(2) 




10 


0.474702(2) 


-1.742(1) 




4 


0.654304(11) 


-1.677(7) 




11 


0.458028(2) 


-1.724(1) 




5 


0.607452(4) 


-1.710(3) 




12 


0.443168(3) 


-1.721(2) 




6 


0.570625(3) 


-1.729(2) 




13 


0.429784(3) 


-1.694(2) 




7 


0.540509(2) 


-1.729(1) 




14 


0.417665(3) 


-1.728(2) 




8 


0.515228(3) 


-1.730(2) 




15 


0.406609(4) 


-1.745(3) 




b) 


S 




As 


B S 


S 


7s 


As 


B S 


2 


0.812386(4) 


-1.765(5) 


0.59(2) 


9 


0.493595(13) 


-1.75(2) 


0.10(9) 


3 


0.717637(11) 


-1.67(2) 


0.03(8) 


10 


0.474696(9) 


-1.73(2) 


-0.05(7) 


4 


0.654487(7) 


-1.892(8) 


0.77(3) 


11 


0.458017(9) 


-1.71(2) 


-0.09(7) 


5 


0.607495(20) 


-1.78(3) 


0.33(12) 


12 


0.443176(12) 


-1.73(2) 


0.06(9) 


6 


0.570658(12) 


-1.78(2) 


0.25(8) 


13 


0.429718(10) 


-1.59(2) 


-0.51(7) 


7 


0.540500(9) 


-1.72(2) 


-0.06(7) 


14 


0.417627(13) 


-1.67(2) 


-0.3(1) 


8 


0.515173(10) 


-1.64(2) 


-0.42(8) 


15 


0.406654(16) 


-1.82(2) 


0.34(12) 


c) 


S 


7s 


As 


C s 


S 


7s 


As 


C s 


2 


0.812370(3) 


-1.726(3) 


-2.95(10) 


9 


0.493595(11) 


-1.75(2) 


-0.6(5) 


3 


0.717637(10) 


-1.67(2) 


-0.1(4) 


10 


0.474697(8) 


-1.74(1) 


0.3(4) 


4 


0.654442(4) 


-1.812(4) 


-3.00(7) 


11 


0.458019(8) 


-1.71(1) 


0.4(4) 


5 


0.607490(14) 


-1.76(2) 


-1.8(7) 


12 


0.443175(10) 


-1.73(2) 


-0.3(5) 


6 


0.570653(10) 


-1.77(2) 


-1.3(5) 


13 


0.429728(8) 


-1.62(1) 


2.7(4) 


7 


0.540502(8) 


-1.72(1) 


0.4(4) 


14 


0.417635(11) 


-1.69(2) 


1.5(5) 


8 


0.515182(9) 


-1.67(1) 


2.2(4) 


15 


0.406649(14) 


-1.80(2) 


-1.9(7) 



almost indistinguishable in the range 10 4 < N < 2.10 4 . 
We are thus tempted to conjecture that u' is indepen- 
dent of S 1 , and truly characterizes the universality class of 
the Random String model. Assuming that the asymptotic 
scaling regime is almost reached on figure (|A) leads to 
the estimate w' = 0.418 ± 0.005. 

The distribution of Ln is also of interest. We found 
that the random variable Xn — (Ln-E(Ln))/ ^Var(LN) 
is very nearly normally distributed even at rather small 
values of N. These findings indicate that a central limit 
theorem should apply to the LCS length of two random 
strings, despite the nonlinear growth of V ar{L jy) ■ Figure 
(||) shows the results of a computation in the case of binary 
strings. 

2.3 Computations for the Bernoulli Matching model. 

We have performed similar Monte Carlo simulations for 
the Benoulli Matching model. For computational reasons 
(generating two pseudo-random strings of size ./V is faster 
than a whole N x N matrix), we restricted extensive com- 
putations to sizes N < 1500. We nevertheless performed 
a limited set of computations at sizes up to N — 10 5 , in 
order to check the validity of (||) in that case. We found 
that this finite size scaling law applies to the mean value 



E{L%)/N as well. Using the same method as above we 
obtained the estimates of 7J? which are quoted in table 
(|J) for 2 < S < 15. These are not as precise as the cor- 
responding estimates for the Random String model, since 
the extrapolation was restricted to smaller values of N. 
However we estimate the precision on 7^ to be better 
than 0.1%. More interestingly, we found that the values of 
7J? are very well reproduced by the simple expression 

7 # = 2/(l + >/5), (5) 

a formula which had already been conjectured by Steele 
[ plp2] |. In fact Steele made his conjecture for the original 
LCS problem, at a time where precise numerical estimates 
of 7s where not available, but it happens to be valid for 
the Bernoulli Matching model. 

A short discussion may be instructive. Let Ak be the 
event that there exists a sequence of matches of length k. 
Then the length of a longest sequence of matches is 

N 

£ = E^> ( 6 ) 

k=l 

where 1a is the indicator of set A in the sample space 
fi of the model (be it the random string model or the 
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Random String model 




Fig. 2. Scaling of the variance of the LCS length. Random 
String model: Averages over 10 5 instances for 1 < TV < 1500 
and over 10 4 instances for 1500 < 10 4 . Bernoulli Matching 
model: Averages over 10 4 instances for 1< N < 1500. Dashed 
lines of slope 2/3 give the expected scaling for the Bernoulli 
Matching model. 

Table 2. Estimates of jg for 2 < S < 15. The extrapolation of 
E(L%)/N, N < 1500, was based on (§) with e N ~ C S /(N In N) 
(values obtained for As and Cs are not reproduced). Precision 
on 7s, estimated as the range of variation of our estimates 
for several choices e N ~ K s {N\n a TV) -1 , is about 0.05%. The 
conjectured values 2/(1 + \f§) for j§ are also quoted. 



s 


7f 


2/(1 + VS) 


S 




2/(1 + VS) 


2 


0.82860 


0.828427 


9 


0.50047 


0.5 


3 


0.73236 


0.732051 


10 


0.48082 


0.480506 


4 


0.66698 


0.666667 


11 


0.46383 


0.463325 


5 


0.61823 


0.618034 


12 


0.44850 


0.448018 


6 


0.58030 


0.579796 


13 


0.43484 


0.434259 


7 


0.54892 


0.548584 


14 


0.42223 


0.421793 


8 


0.52291 


0.522408 


15 


0.41077 


0.410426 




-6 -4 1 2 



Fig. 3. (A) Scaling of of Var{L N ) for 10 4 < N < 2.10 4 in 
cases S = 2 and S — 4 (averages over 5000 random strings). 
The solid lines are best linear fits (slope 0.830 for S — 2 and 
0.844 for S = 4). The dashed line has reference slope 2/3. (B) 
Histogram of the values of Xn for S — 2 and N = 500 (aver- 
ages over 10 4 random strings). The solid curve corresponds to 
the normal distribution with mean and unit variance. 

Bernoulli Matching model). Hence the mean value E(L) 
essentially depends on the behaviour of the probabilities 
P(Ak): Using the martingale difference method (see e.g. 
[p2[), one finds that 

P(\L-E(L)\ > k) < 2e~^r (7) 

hence P(A k ) > 1 - 2a^{-{EL - k) 2 /8N) for k < E(L), 
and P{A k ) < 2exp(-(fc - EL) 2 /8N) for k > E(L). 

The location of this transition is very difficult to com- 
pute, but it is clearly related to the behaviour of the ran- 
dom variable A4(w) defined as the number of sequences 
of matches of length k for a given instance us. Clearly 

P(A k ) < E(N k ) = S- k (^j ■ (8) 

Setting k = xN for < x < 1 and using Stirling formula, 
it is found Q] that E(J\f k ) has a transition from exponen- 
tially growing to exponentially decreasing behaviour at a 
value x — xs given by the solution to x(l — x)^ x ^ x = 
S- 1 / 2 . Hence is is an upper bound for 75 and 75 . It is 
not very accurate: One has £2 « 0.9, and as S — ► 00, 
xs ~ e/y/S which is not what one would expect from (||). 
The reason of this failure is that Af k is not a self-averaging 
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the related Longest Increasing Subsequence (LIS) Prob- 
lem. Given a sequence of distinct numbers x±, ...,xjv this 
problem asks for a sequence 1 < i\ < ... < ik < TV such 
that Xi t < ... < Xi k and k is maximal. When the Xi s are 
i.i.d. random variables uniformly distributed in [0, 1], it is 
known that the expected length of a LIS is asymptotic to 
JisVN where 775 = 2 |26|. Now let A/^ 75 ' be the num- 
ber of increasing subsequences of length k of Xi, xn> so 
that 

wn=("M do) 



k k\' 



Using Stirling formula, one finds that E(Afjf S) ) has a tr an- 
sition from a rapidly growing to a rapidly decreasing be- 
haviour at k ~ eV^/V, and presents a sharp maximum at 
k ~ xisVN where xis = 1- Hence 775 = 2xis and the 
above approximation is actually exact in this case. 

As a byproduct, expression (||) provides a consistent 
mean to check the validity of the finite size scaling (||) 
for the Bernoulli Matching model. Indeed we can measure 
directly the scaling in TV of the quantity 



a s (N) = VN(rfjfN-E(L%)). 



(11) 



Fig. 4. Scaling of the finite size corrections to linear growth 
for the Bernoulli Matching model. The figure represents a plot 
of \na(N) defined in the text in function of In In TV (300 < 
TV < 1500) for different values of S. Dashed lines with slope 
— 1 visualize the scaling expected from (0). 



quantity, so that its mean value does not reproduce well its 
typical behaviour. Consider then the "entropy" ln(A4 + l)- 
This is a self-averaging quantity from which 7 C (= 7s or 
75 ) can be computed as the smallest number < 7 < 1 
such that x > 7 implies 



lim 



£ln(A4 



1) 



N 



0. 



(9) 



Clearly the function f(x) = limjv-+oo N' 1 E \n(Af X N + 1) is 
singular at x = j c . From the results of section (|J), we even 
expect f{x) to be discontinuous at x = j c . Unfortunately, 
computing E\n(Afk + 1) is still a difficult problem. 

Steele suggested another approach to the problem Q , 
which consists of looking at the maximum of A//c(w)- The 
location k max of this maximum is a self-averaging quan- 
tity which may be expected to be comparable in a simple 
way with the LCS length: A plausible guess is that with 
probability one, k max /L — > 1/2 as TV — > 00. Assuming this 
we must maximize f(x) defined above, and the situation 
is not much better than before. But now the approxima- 
tion of replacing A4 by its mean value does work much 
better: E(Mk) has a sharp maximum at k ~ xsN, where 
x s = 1/(1 + vS). Hence quite surprisingly, 2x$ is a really 
good estimate for 75, and it happens to give the correct 
value of 7g . We have no explanation for this observation, 
but we remark that a similar computation can be done for 



As is shown in figure (§), In as (TV) has a near linear de- 
pendancc on ln(lnTV) with a slope consistent with —1, as 
is expected by assuming the validity of (||). 



3 The case N ^ M and a cavity solution. 

There is still another way to study the asymptotic be- 
haviour of E(Ln), which consists of working directly with 
the recurrence relation (Q) . This point of view has the ad- 
vantage that it enables one to study the case M ^ TV in a 
natural way, leading to a generalization of (|5|) to the case 
where TV, M — » 00, the ratio r = M/N being fixed. 

In order to find the asymptotics of ([!]) it is convenient 
not to work with Lij directly, but rather (as in ]l3|) with 
the differences vij and Hij defined by 

v%j = Lij - /•- i,,-/'<, = L ij ~ L i,j-i- ^ <i,i < N. 

(12) 

The recurrence relations for and are readily seen 
to be 



Vij — max (0, e^, 
fj,ij = max (0, e.j 



Mt-l.j) v i,j-l — (H—l,j) 
- /',., 1.//,., 1 - (13) 



with boundary conditions 1^0 = VQ,i = Mi,0 = Mo,i = 0. In 
the Random String model we have = <^Xi,K, , whereas in 
the Bernoulli Matching model the e^-'s are i.i.d. Bernoulli 
variables with P(e« = 1) = 1 - P(e^ = 0) = 1/5. We 



consider relations (13) as a kind of exact cavity equations 
[[is) for the LCS problem. The LCS length Ljy can be 
retrieved by summing the Ky's and /x^'s along the first 
bissector. To be precise 



N 

Ln — ^ (f-ii 



N 

E 

i=i 



(14) 
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When TV ^ M, M/N — r, we view Lmm as a sum along a 
path "as straight as possible" in the direction defined by 
r, e.g. a path zigzaging along the straight line joining the 
points (0, 0) and (N, M) in such a way as to keep as close 
as possible from this line. 

A simple, but important observation is that the vari- 
ables Vij and fiij can take only the values and 1. Hence 
let us introduce the probabilities 



Pij = P(fij = 1), 



Pi 



1) 



such that (2,j) is a match, and so on. The differences 
jk+i — jk are independent random variables with mean 
value 5. By the law of large numbers, jk is asymptotic 
to kS as k — > co. It follows that the length of this se- 
quence of matches, restricted to the integer points (ij) 
such that j < M, is asymptotic to M/S — N as N — > 00. 
Hence for r > 5 we have 75 (r) = 75 (r) = 1 (and also 
7s(l/r) = 75(1/^) = 1/r by symmetry). This means 
that when i is large and j > Si, Lij is nearly equal to 
(15) i i hence for each %' < i and j' > j we have vyy = 1 and 



As i and j —>■ 00, it is natural to expect that p^ and p'^ 
have limits depending only on the ratio r — j/i. We denote 
these limits respectively by p(r) and p'(r) (or p and p 1 for 
short). 

For a given r > 0, Lj^t r m is a sum of N terms 
and vN terms fiij, along a path "close" to the straight 
line from (0,0) to (N,[rN}). Hence the limit 7s (r) = 
limjv—,00 E(Ljsr t [ r N])/N must be given by 



7s (0 = p(r) + rp'(r). 



(16) 



Now a relation between p(r) and p'(r) can be readily ob- 
tained from (^||) if we assume that for large i and j, the 
occupation numbers v\ j and fiij are nearly independent 
variables. It turns out that this decorrelation property 
holds true in the Bernoulli Matching model. It can be 
justified on the basis of a transfert matrix method for this 
percolation problem which will be presented elsewhere Q . 
In the limit i, j : — > 00 we are thus led to the following self- 
consistent equations: 

Pij = 1 - Pi-x,j ~ U - 1 /S){1 - />,., 1 i'M - p-.ij), 
Pij = 1 - Pij-i - (1 - - - Pi-i,i)(17) 

If we now let p(r) = limi_ >00 p i) [ r . i ] andp'(r) = lim^ooP^] 
and note that ^ = p it [ ri ]-l/i(d/ dr)p{r) and = 
^ [ri] r/i(d/dr)p'(r) up to negligible terms in the limit 
i — > 00, then taking the sum and the difference in ( p"7| ) 
leads to 

l=p + p' + (S-l)pp' (18) 

and 

i. p ( r)+r |.y (r ) = 0. (19) 
dr dr 

These last two equations determine the functions p(r) and 
p'(r) completely. A simple computation gives now 



p(r) = 



frS-l 
5-1 ' 



p\r) = 



5-1 



(20) 



Note that the relation p'(r) = p(l/r) is obvious from sym- 
metry considerations. It must be also remarked that (20) 
is only satisfied for 1/5 < r < S (although @ and @ 
are valid for all r except r = 5 and r = 1/5): The LCS 
problem has a percolation transition when one of the two 
strings is 5 times larger than the other. Suppose for in- 
stance r = M/N — S. Consider the sequence of matches 
(1, ji), (2, j 2 ), where j\ is the smallest integer j > 1 
such that (I, j) is a match, ji is the smallest integer j > ji 



fii'j' = with high probability. In other words, r > 5 im- 
plies p(r) — 1 andp'(r) = 0, an d by symmetry, p(l/r) = 
andp'(l/r) = 1. From (|20| ) and (16) we find the expression 
of the function 7J? (r) of the Bernoulli Matching model for 
1/5 < r < 5: 



7#(0 



2Vr5 



1 



(21) 



Note that the transition of 7^ (r) at r = 5 and r = 1/5 is 
"second order", that is djg /dr = p'(r) is continuous and 



d^/dr 



is discontinuous at 



5 and 



1/5. 



Figure (|5|) shows the confrontation of equation (|21 
to a Monte Carlo computation of the Bernoulli Matching 
model for 5 = 2 and 5 = 15. 

We have plotted, for several values of t, the "curves" 
delimiting the set Ct/t in the two dimensional (x,y) plane, 
where Ct = {(ij) '■ 1 < i,j < N, L,j < t}. As t — * 00, the 
boundary of Ct ft approaches asymptotically the curve of 



parametric equation 



This is the 



solid curve which we have plotted using (|2l|). Figure (|6j) 
reproduces for comparison the results of analogous com- 
putations made for the Random String model. Note that 
as 5 increases, the differences between the results for the 
Bernoulli Matching model and the random string model 
are less and less significant, and it is reasonable to expect 
that js(r) is asymptotic to 75 (f) as 5 — * 00. Numerically 
the convergence is rather rapid: the quantity 5(7^ — 75) 
shows a maximum at 5 ~ 11 after which it happens to 
decrease. Such a phenomenon has already been observed 
and interpreted in other combinatorial optimisation prob- 
lems II , and it would be of interest to have a theoretical 
understanding of the large 5 behaviour of 7J? — js- We 
leave this question open for future work. 



4 Configuration space properties of the LCS 
problem. 

In this section we study generic properties of the set of 
solutions of the LCS problem, that is average properties 
of the set of all LCSs of two random strings. 

A most direct computational access to these proper- 
ties is provided by what we shall call the LCS graph of a 
given instance. Given any strings X and Y of length N, 
this graph is defined as follows. The vertices are the LCS 
matches, that is the set of points (ij), 1 < i, j < N such 
that Xi = Yj and (ij) occurs in at least a LCS of X and 
Y. Two LCS matches are incident in the LCS graph if 
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1.2 1.4 1.6 1.8 




1.2 1.4 1.6 





Fig. 5. The boundary shape of the set Ct/t for the Bernoulli 
Matching model for S = 2 (t = 100, 500, 1000, 2300) and S = 
15 (t — 100, 300, 700). Each dotted line represents an average 
over 1000 instances of size N ((N,N) Bernoulli matrices) with 
N = 3000 for S = 2 and N = 2000 for S = 15. The solid curve 
is the asymptotic shape predicted from (fell). 



Fig. 6. The boundary shape of the set Ct/t for the 
LCS problem (Random String model) for S = 2 (t = 
100, 500, 1000, 1500, 2000) and S = 15 (t = 100, 300, 500, 1100). 
Each dotted line represents an average over 1000 random 
strings of size N = 3000. The solid curve, plotted for com- 
parison, is the asymptotic shape predicted from (^l|) for the 
Bernoulli Matching model. 



they occur as successive matches (regardless the order) in 
the same LCS. 

It is a nice feature of the LCS problem that this struc- 
ture may be computed in a very efficient way. To a large 
part, this circumstance is due to the directed nature of 
the problem, which greatly simplifies the structure of the 
set of solutions. 



4.1 Construction of the LCS graph. 



Since the construction we have used is rather simple we 
shall not give a precise algorithm, but rather indicate the 
main steps, together with the main observations which 
enable an efficient implementation. 

Given integer points («iji) and (^2^2) we write (hji) -< 
(^2^2) if i\ < ii and 12 < ji- Suppose the LCS matrix of 



X and Y is computed, and let L be the length of a LCS 
of X and Y. Following the terminology of B , we call an 
integer point (ij) such that Xj = Yj a match of rank k if 
k is the length of a LCS of X\, ...,_Xj and Y\,...,Yj, It is 
then easy to construct, for each 1 < k < L, a list M{k) 
of the matches of rank k of X and Y . It is convenient 
to have the members (ij) of M(k) ordered lexicographi- 
cally, in such a way that i and j vary in opposite direc- 
tions, e.g. i increasing while j is decreasing. Then setting 
M(k) = {(«iji), (i mk jm k )}, one sees that (i x , ...,i mk ) is 
an increasing sequence, while (ji, ...,j mk ) is a decreasing 
sequence. The reason for this is that given any two mem- 
bers (ij) and (i'j') of M(k) we have i < i' j > j' , since 
otherwise (ij) and (i'j') would not be of the same rank. 
This property is important for an efficient construction of 
the LCS graph. 

The lists M (k) are the basic data in the construction 
of the LCS graph. Remark that the members of M(L) are 
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obviously LCS matches, hence these must be included as 
vertices of the LCS graph. If P is a match of rank k < L, 
then P is a LCS match if and only if there is a LCS match 
Q of rank k + 1 such that P -< Q. Remark also that, by 
definition, a LCS match of rank k may be connected only 
to LCS matches of rank k — 1 or k + 1 in the LCS graph. 
If P is a LCS match of rank k > 1, and Q is a LCS match 
of rank k — 1, then P and Q are connected if and only if 
Q -< P. We will denote by M LC s(k) the list of the LCS 
matches of rank k, ordered in the way which is inherited 
from the ordering of M(k). 

We construct the LCS graph in L stages numbered 
k = L,L — 1, 1. Stage L consists of inserting all matches 
of rank L as vertices of the LCS graph. Once all the LCS 
matches of rank > k have been inserted, stage k consists of 
selecting the members of M(k) which belong to MLCs(k), 
and then to insert the required edges connecting M^cs{k) 
to M LCS (k + 1). 

Using remarks made previously and exploiting the way 
M{k) and M^csik + 1) have been ordered, it is easy to 
see that the selection of the members of MLCs(k) from 
those of M(k) at stage k may be performed in 0(rrik + 
ifc+i) steps, nik and lk+i being the cardinality of M{k) 
and M^csik + 1) respectively. Hence the detection of the 
whole set of LCS matches takes at most 0(m) steps in 
this construction, m = J^ fe rrifc being the total number of 
matches of X and Y. The main part of the computation 
is devoted to the insertion of the edges in the LCS graph. 
The number of operations (comparisons and insertions) 
needed to determine the edges connecting MlcsW an d 
MLcs(k + 1), once these lists are known, is of order 0(1%). 
Since there is no obvious bound for Ik better than m&, 
and no obvious bound for m& better than 2N, we obtain 
a bound for the time required to compute the LCS graph 
which is 0{LN 2 ). 

However when X and Y are random strings from a 
finite alphabet, the typical values of Ik happen to be much 
smaller than m^, and the typical time required by the 
above construction is in fact much smaller than 0(LN 2 ). 



4.2 Computations of the LCS graph. 



400 600 



00 1000 1200 



00 1000 1200 



Fig. 7. Mean value (A) and variance (B) of the ground state 
entropy Sn = lnA/ics as a function of N, for S — 2 (Random 
String model, averages over 10 4 instances). 



also nearly linear. We observed this behaviour for all the 
values of S we considered. 

Hence we found that the number of LCSs of two typical 
random strings is very large. Nlcs typically grows expo- 
nentially with N, with a well defined exponential factor 
as, which we define, assuming the limit indeed exists, as 



We performed a series of Monte Carlo computations in 
order to study some of the basic properties of the set of 
LCSs of two random strings. We concentrated our study 
on different quantities which can be easily computed once 
the LCS graph is constructed. 

Probably the most basic quantity which characterizes 
the set of LCSs is its cardinality Nlcs- Figure (0) repro- 
duces the estimated average and variance of the ground 
state entropy Sn = \tiNlcs in casc S = 2, computed 
over 10 4 random instances and for values of N ranging 
from 100 to 1000. It is rather striking on this figure that 
E(Sn) grows linearly with N. We expect the random vari- 
able In Nlcs to be self-averaging, and this is confirmed by 
the measured behaviour of its variance, whose growth is 



as 



lim 

N— >oo 



E(S. 



N 



N 



Also we define (provided the limit exists) 

Var(S N ) 



Ps 



lim 



N 



(22) 



(23) 



Using best linear fits we obtained rather precise estimates 
of as and /3g, which are quoted in tabic (|^) for several 
values of S. 

Another quantity reflecting the "size" of the set of 
LCSs of two random strings is the typical overlap of two 
LCSs. Viewing a LCS of X and Y as a sequence of integer 
points we define the overlap of two LCSs o~\ = (Qi, Ql) 
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and (72 = (-Pi 5 ■■•,Pl) as the quantity 



q = q(<J 1 ,a 2 ) = j- ^ S(Q k , P k ). 



(24) 



fc=i 



where 5(Qu, Pk) = 1 if Qk = Pk and otherwise. q(<J\, a 2) 
is analogous to the order parameter used in the theory 
of spin glasses The quantity L(l — q(o\,(J2)) should 
be regarded as a kind of Hamming distance in the space 
of LCSs of X and Y . The object of interest here is the 
empirical distribution of (7(0-1, 02) for o\ and 02 ranging 
over the set of LCSs of X and Y. We denote by < q > 
and < q 2 > the first and second moment of the overlap 
under this distribution. It is not difficult to see that 



< q >-- 



where 



tX E *«) a 

}eM LCS (k) 

Mlcs(Q) 



k=l 



Pi(Q) 



(25) 



(26) 



Mlcs ' 

and Nlcs{Q) is the number of LCSs of X and Y of which 
the integer point Q is a member. Hence the average overlap 
< q > can be easily computed for any given instance of 
X, Y once the LCS graph is constructed. Also we have 



1 



<? 2 >=7>EE E E 

fe=l 1=1 QeM iCS (fc) Q'£M LCS (l) 



where 



P2(Q,Q') = 



Mlcs(Q,Q') 



p 2 (Q,Q') 2 

(27) 
(28) 



and Mlcs{Q: Q') is the number of LCSs of X and Y of 
which points Q and Q' are members. It is still elementary 
to compute < q 2 >, but more computationally lengthy 
due to the above double summation. 

We denote the averages of < q > and < q 2 > over the 
random strings X and Y simply by E(q) and E(q 2 ). Figure 
@ presents the results of a Monte Carlo computation of 
E{q) and Var(q) = E{q 2 ) - (Eq) 2 in the case S = 2. This 
figure shows that E(q) has a nearly 1/y/W convergence to 
a limit value qs as N — > 00 . Not surprisingly in view of the 
fact that Nlcs grows exponentially with TV, we find that 
qs < 1. Estimates of qs based on a 1/yN extrapolation 
of our finite size results are given in table (||). 

It is also seen on figure (||) that the variance of the 
overlap decreases with TV roughly as 1/N. Hence we con- 
clude that the overlap q{a\,G2) of two randomly chosen 
LCSs happens to be self-averaging, i.e. q{a\,a2) becomes 
non random (and equal to qs) in the limit N — > 00. This 
is in fact not surprising: the space of LCSs of two random 
strings is not very far from having a product structure and 
the quantity (1 — q(cri, 02)) is a kind of (normalized) Ham- 
ming distance on this space. In the conventional wisdom 
of statistical mechanics, we would say that, although there 
is some pathology in this system from a physical point of 
view (it does not satisfy "Nernst's principle"), it presents 
no replica symmetry breaking. 




0.02 0.0 



1/N» (1/2) 



Var(q) 



0.01 0.02 0.03 0. 



0.05 0. 



Fig. 8. (A) The average overlap E(q) of two random LCSs 
as a function of 1/VN (100 < N < 1000, averages over 10 4 
random strings). (B) The variance Var{q) = E(q 2 ) - (Eq) 2 of 
q as a function of 1/N (10 < N < 100). Statistical error bars in 
(A) were obtained from estimates of the standard deviation of 
< q >, not to be confused with the overall standard deviation 
\/Varq which is larger and is much more lengthy to compute. 



Table 3. The exponential growth factor of the number of LCSs 
of two random strings and the average overlap between two 
LCSs. 



s 


as 


Ps 


qs 


2 


0.2458(8) 


0.232(2) 


0.6753(8) 


3 


0.2302(4) 


0.171(1) 


0.6782(8) 


4 


0.2086(3) 


0.145(2) 


0.6851(7) 


5 


0.1903(2) 


0.125(2) 


0.6921(7) 


10 


0.1365(2) 


0.0885(1) 


0.7138(10) 


15 


0.1100(1) 


0.0711(1) 


0.7264(8) 
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We also considered quantities which are of interest to 
describe the "shape" of the LCS graph. Two such quanti- 
ties are the distribution of the distance between two suc- 
cessive matches of a LCS, and the distribution of the num- 
ber of LCS matches of a given rank. More precisely, we let 
V(d, X, Y) be the empirical distribution, over the set of 
LCSs of X and Y, of the distance between two successive 
LCS matches: 



V(d,X,Y) 



L-l 



L-l 

E 

fc=l Q£M LCS (k) 



^LCs(Q,d) 



LCS 



(29) 



Here AfLCs(Q,d) is the number of LCS a — (Qi, Ql) 
of X and Y such that Qk — Q for some k < L, and 
\Qk+i — Qk\ = d (the distance between two points is taken 
to be Manhattan distance — (*2 J2 ) | = — ^l + lii — 

j2 1 ) - We define Vs(d, N) as the average of V(d, X, Y) over 
random S-siy strings of size N . Also we let 7T(m, X, Y) be 
the empirical distribution of the cardinality of M^csik) 
over 1 < k < L, i.e. 



1 L 

n(m,X,Y) = -Y,5(l k ,m), 

k=l 



(30) 



Ik being the number of LCS matches of rank fc, and we 
let IIs(m,N) be the average of II(m, X,Y) over X and 
Y. It is natural to expect that Vs(d, N) has a limit Vsid) 
as N — > oo. It is not so obvious that the same holds for 
7Ts(m, N). We found numerically that both Vs{d, N) and 
lis (I, X) approach well defined distributions as N grows. 
Figure (j|) reproduces graphically Vs{d) for S = 2,4,10 
and 15. As S increases the maximum of Vs(d) becomes 
more and more pronounced and is displaced to the right, 
as is expected from the relation J2 d>0 dVs(d) — 2/75. 
The asymptotic shape of i7s(m) appears to depend much 
less drastically on S so we give only the results obtained 
for S = 2 and S — 4 (figure (|To[) ) . Numerically it is found 
that the typical number of LCS matches of a given rank 
remains bounded as N grows. 

This contrasts with the behaviour of the diameter of 
the sets M^csik) (in the Manhattan distance). This be- 
haviour is shown in figure Figure (|ll]), where are plotted 
the quantities Ds{N) and Vs(N), defined to be respec- 
tively the mean and variance over random 5*-ary strings 
of size N of 



L 



D S (X,Y) = ^^dicu,,[M L r, <k) 

k=l 



(31) 



Clearly Ds(N) appears to grow with N. In fact from 
heuristic scaling arguments, we expect Ds(N) to be of the 
same order as the finite size corrections to the linear scal- 
ing of E(Ln). If we are confident in (|j), this means that 
Ds(N) should grow asymptotically as ^/N/hiN. Fortu- 
nately this is what we find from a y 2 analysis: The solid 
curve in figure (|TT|) (A) is a best fit of our estimates to a 
function of the form C1+C2VN / \nN. The corresponding 
X 2 value is 12, 74 for a number of degrees of freedom of 



2 4 6 



Fig. 9. The distribution Vs{d) of the distance between two 
successive LCS matches, for S = 2, 4, 10, 15 (averages over 10 4 
random strings in each case) . Each figure show results for differ- 
ent values of 100 < N < 1500 superposed in order to visualize 
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Fig. 10. The distribution IIs(m) of the number of LCS 
matches of a given rank for S = 2 and S = 15 (averages over 
10 4 random strings in each cases). 



Fig. 11. Behaviour of (A) the mean Ds(N) and (B) the vari- 
ance Vs{N) of the width of the LCS graph (averages over 10 4 
random 15-ary strings). 



13. For comparison, the x 2 value achieved from a best fit 
to C\ + Ci\f~N is of 37.7, which is much too large. This 
numerical test provides another support to the reliability 
of . Note however that the fluctuations of D$ (X, Y) are 
far from negligible, as the variance of Ds(X, Y) shows a 
near linear growth. 

The asymptotic distributions Vs(d) and Ils(m) pro- 
vide useful informations on the local properties of the LCS 
graph, but they tell nothing about the universality class 
of the LCS problem. Results for the mean square "dis- 
placement" i — j along the LCS graph are presented in 
figure ( |r^ ) . One way to measure this quantity would be to 
generate a given LCS in a sequential way and to perform 
averages along this LCS Jll| . Since we are able to perform 
exact averages over the set of LCSs, we use here a more 
"static" definition: For a given instance, the mean square 
displacement along a LCS chosen at random is given by 



<(*-j) 2 >= Z E E P i(Q)(*-3) 2 (32) 

k=l Q£M LCS (k) 



where P\{Q) is defined as before and we have set Q = 
(ij). We then estimate E((i — j) 2 ) as an average over a 
large number of random strings of < (i — j) 2 > com- 
puted for each instance. The price to pay for exact com- 
putations is mainly a limitation on the size N of our in- 
stances. It is seen on figure ([j^) however that the scal- 
ing behaviour E((i ~ j) 2 ) m TV 4 / 3 is reached rather fast, 
both for the Bernoulli matching model and for the Ran- 
dom String model. We cannot exclude the possibility of a 
crossover at N 1500 for the Random String model. But 
then the asymptotic scaling regime of E((i — j)) 2 would 
be attained at much larger values of N than for Var(L^), 
which seems unlikely. 



5 Concluding remarks. 

This article has been devoted to the presentation of a thor- 
ough investigation of the LCS Problem by means of nu- 
merical simulations. One of our main findings is that the 
finite size behaviour of the average LCS length E(Lm) is 
very well reproduced by (^) . This form provides a numeri- 
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In E(<i-j)"2) 



Fig. 12. Scaling of the average "displacement" E((i — j) 2 ) 
along the LCS graph, for the Random String model (RS) 
(100 < N < 1500) and for the Bernoulli Matching model (B) 
(100 < N < 1000). Averages are over 10 4 instances in each 
case, with 5 = 2. The TV 4 / 3 scaling is visualized by the dashed 
line of slope 4/3. 



cally trustworthy method of extrapolation, from which we 
have improved significantly the precision on previous esti- 
mates of the limit ratio 7s- It is very difficult at present to 
find any theoretical insight which could justify ([|). Even 
improving on Alexander's rate result seems very difficult. 
It could be useful in this respect to have a better under- 
standing on the effects of boundary conditions in these 
kind of problems. 

We also studied a related model where the two strings 
are replaced by a matrix of i.i.d. Bernoulli variables indi- 
cating the locations of the matches. We obtained a sim- 
ple analytic expression ( ]2l| ) for the "passage time" func- 
tion -fg (r) of this Bernoulli Matching model. This expres- 
sion compares very well with our numerical results, and it 
provides also an excellent approximation for the Random 
String model. As this approximation becomes more and 
more accurate as S becomes large, a natural question is 
then whether one could evaluate some corrections induced 
by the correlations among matched points in the Random 
String model. 

A further interesting question concerns the applica- 
bility of the cavity-like method used to derive (|2l|). What 



makes this method work for the Bernoulli Matching model 
is that a remarquable decorrelation property holds in this 
percolation problem fl. It would be interesting to find 
other percolation problems where such a decorrelation prop- 
erty occurs. This would provide simple means to obtain 
analytical information on the passage time constants of 
such models. 

We finally investigated average properties of the set of 
solutions, and the "universality class" of the LCS problem. 
We were rather surprised to find that the number of com- 
mon subsequences of maximal size of two typical random 
strings grows exponentially with the size of the strings. 
It follows that two (randomly) given LCSs are to a large 
extent distinct, as confirmed by the study of their typical 
overlap. We also found that the long ranged correlations 
in the Random String model appear to be relevant to the 
universality class of the model, as is seen from the large 
N behaviour of Var{Ljsi). One may wonder why this has 
not been observed in Needleman-Wunsch sequence align- 
ment p2[ . A plausible reason (pointed out in is that 
introducing a gap penalty in the model results in bind- 
ing more tightly the optimal paths to the first bissector. 
This should reduce the effect of correlations and extend 
the "small N" scaling regime to larger values of N. In 
particular for biological purposes only the small N regime 
is likely to be relevant. An exciting issue is the possible 
occurence of a phase transition in the gap parameter of 
Needleman-Wunsch alignment. 

Another interesting question is whether a proliferation 
of solutions is specific to random sequences and subse- 
quences, or if such phenomenon is of relevance to other 
percolation situations. As already said, the smallness of 
the variance of Ljq is probably related to the large num- 
ber of LCSs of two random sequences. Smallness of the 
variance of the passage time from (0, 0) to (0, N) is also 
observed in usual first passage percolation on Z . In fact 
these models (a famous example of which is the Eden 
model) are known to fall into the universality class of di- 
rected polymers in random media. One may expect to find 
in these models a large number of quasi optimal paths with 
typical overlaps smaller than one. 
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